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METHODS AND APPARATUS FOR PARALLEL IMPLEMENTATIONS OF TABLE 
<=■ LOOK-UPS AND CIPHERING 

CO 

-o 

O Field of the Invention 

The invention relates to a method and apparatus for 
5 parallel implementations of table look-ups. For example, the 
invention relates to a parallel implementation oe table look- 
ups in the context of a Kasumi algorithm for Ciphering 
(Encryption) in communications networks. 

Background of the Invention 

10 m networks, for example a UMTS (Universal Mobile 

Telecommunications System) network, a Kasumi ciphering 
algorithm has been used for ciphering, which is also known as 
Encryption. In particular, data being transmitted is ciphered 
for transmission. Referring to Figure 1, shown is block 
15 diagram of a ciphering block 100 operating on input data 140 
being transmitted at for example an RNC (Radio Network 
Controller) in a UMTS network (not shown) . The ciphering block 
100 implements a Kasumi ciphering algorithm that produces a 64- 
bit output 130 from a 64-bit input 110 under th* control of a 

20 128-bit key 120. The input data 140 undergoes an exclusive-OR 
operation 150 using the output 130 from the ciphering block 100 
resulting in ciphered data 160. In particular, the Kasumi 
algorithm is a Feistel cipher as shown in Figures 2A to 2D with 
eight rounds in which a number of functions are evaluated at 

25 each of the eight rounds. The functions of each of the eight 
rounds are described in detail in a document entitled w KASUMI 
Specification" available at 

http://www.3gpp.org/TB/other/algorithms/35202-311.pdf, which is 
incorporated herein by reference. In particular, at each of 
30 the eight rounds two of the functions referred to as an S7 

function and an S9 function are each evaluated 6 rimes. The S7 

1 
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function maps a 7-bit input X defined by bits x, (i - 0 to 6), 
to a 7-bit output Y defined by bits y } (j - 0 to 6) . The S9 
function maps a 9-bit input X' defined by bits x' K (k = 0 to 8), 
to a 9-bit output Y' defined by bits yi (1 = 0 to 8). 

5 For the S7 function, the output Y is a function of X. 

Equivalents, each bit y, is a function of the bits x ± as given 
by Equations 200, 201, 202, 203, 204, 205, 206 shown in Figure 
3. in Equations 200, 201, 202, 203, 204, 205, 206, x m x n (m, n - 
0 to 6) is written as a short form for x m r\ x n where n is an 
10 AND operator. Similarly, in Equations 200, 201, 202, 203, 204, 
205, 206, x m x n x 0 (o = 0 to 6) is written as a short form for x m 
n x„ n x 0 . Finally, in Equations 200, 201, 202, 203, 204, 
205, 206, © is an exclusive-OR operator. 

For the S9 function the output Y' is a function of 
15 X' . Equivalently, each of the bits y\ is a function of the 
bits x' k as given by Equations 300, 301, 302, 303, 304, 305, 
306, 307, 308 shown in Figure 4. In Equations 300, 301, 302, 
303, 304, 305, 306, 307, 308, x' p x' q (p, q - 0 to 8) is written 
as a short form for x' p r\ x', . Similarly, in Equations 300, 
20 301, 302, 303, 304, 305, 306, 307, 308, x' p x q x' B ir « 0 to 8) is 
written as a short form for x' p r> x q n x' r . 

The Kasumi algorithm including evaluation of the S7 
and S9 functions have not been implemented in parallel for 
multiple inputs. Since most of the computing in the Kasumi 
25 algorithm involves evaluating the S7 and S9 functions, the non- 
parallel implementation for evaluating these functions imposes 
considerable limitations in efficiency. 

Some non-parallel implementations have been developed 
usina software written in assembly language; however, CPU 

2 
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(Central Processing Unit) resources required by the Kasumi 
algorithm are still limiting. 

Summary of the Invention 

A method and apparatus are used to generate outputs 
5 according to a ciphering algorithm which for each of the 

outputs operates on a respective input using a respective key. 
The ciphering algorithm has a plurality of rounds in which 
functions are evaluated. For a least one of the functions, 
outputs are generated by looking up at least one look-up table 
10 with each look-up table being looked-up in parallel using 

respective inputs. Different methods for parallel table look- 
ups are provided. The methods allows the ciphering algorithm 
to be implemented partially or entirely in parallel. 

One parallel implementation involves the Kasumi 

15 algorithm in which S7 and S9 functions are evaluated in 

parallel for a plurality of inputs using vector instructions on 
an SIMD (Single Instruction Multiple Data) architecture- In 
some implementations, the methods of looking up look-up tables 
make use of look-up tables which can be pre-loaded in their 

20 entirety into vectors. For example, in one implementation a 
PowerPC is employed having an Altivec co-processor having 32 
vectors each capable of holding a number of elements. A method 
provides a parallel implementation of the Kasumi algorithm in 
which the S7 and S9 functions are each looked up in parallel 

25 for a plurality of inputs. The method employs Look-up tables 
for the S7 and S9 functions which are pre-loaded in their 
entirety into the 32 vectors for look-ups using vector 
instructions. Such a parallel implementation provides 
processing that is approximately 6 to 8 times faster than 

30 existing non-parallel Kasumi implementations. 
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According to a broad aspect, the invention provides a 
method in which there is a plurality of inputs, each input 
being defined by a first set of bits and a second set of one or 
more bits. For each input of the plurality of inputs and in 
parallel with other inputs of the plurality of inputs the 
method involves for each of a plurality of look-up tables each 
having a plurality of elements, looking-up one of the plurality 
of elements of the look-up table using the first set of bits 
that define the input to obtain an output. The output from 
each of the plurality of look-up tables collectively form a set 
of corresponding outputs. For each input and in parallel with 
the other inputs a corresponding output from the set of 
corresponding outputs is then selected using the second set of 
one or more bits that defines the input. 

15 According to another broad aspect, the invention 

provides an apparatus having a processor and a memory adapted 
to store a plurality of elements of each of a plurality of 
look-up tables. The processor receives a plurality of inputs, 
each input being defined by a first set of bits and a second 
set of one or more bits. For each input of the plurality of 
inputs and in parallel with other inputs of the plurality of 
inputs the processor is adapted to: for each of the plurality 
of look-up tables, look-up one of the plurality of elements of 
the look-up table using the first set of bits that define the 
25 input to obtain an output. For each input, the output from 

each of the plurality of look-up tables collectively form a set 
of corresponding outputs. For each input and in parallel with 
the other inputs the processor is also adapted to select a 
corresponding output from the set of corresponding outputs 
using the second set of one or more bits that define the input. 

According to another broad aspect, the invention 
provides a method in which there is a plurality of inputs each 

4 
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defined by a first plurality of bits. For each input of the 
plurality of inputs and in parallel with other inputs of the 
plurality of inputs, the method involves for each of a 
plurality of look-up tables each having a plurality of 
5 elements: (i) selecting a respective subset of hitis of the 

first plurality of bits that define the input, the bits of the 
respective subset of bits having fewer bits than the first 
plurality of bits of the input; and (ii) looking-up an element 
of the plurality of elements of the look-up table using the 
10 subset of bits to obtain an output. For each input and in 
parallel with the other inputs, the method also involves 
combining the outputs obtained from the pluraliv;y of look-up 
tables to obtain at least one bit. 

According to another broad aspect, the invention 
provides an apparatus having a processor and a memory adapted 
to store a plurality of elements of each of a plurality of 
look-up tables- There is a plurality of inputs each defined by 
a first plurality of bits. For each input of the plurality of 
inputs and in parallel with other inputs of the plurality of 
inputs, the processor is adapted to for each look-up table: (i) 
select a respective subset of bits of the first plurality of 
bits that define the input, the bits of the respective subset 
of bits having fewer bits than the first plurality of bits of 
the input; and (ii) look-up an element of the plurality of 
elements of the look-up table using the subset of bits to 
obtain an output. For each input and in parallel with the 
other inputs the processor is also adapted to combine the 
outputs obtained from the plurality of look-up tables to obtain 
at least one bit. 

30 According to another broad aspect, the invention 

provides a method which in response to N K in -bit inputs performs 
bit permutation/reordering on the N K in -bit inputs to produce M 
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parallel sets of outputs wherein N and K in are integers 
satisfying N, K in > 2 . An ith set of outputs of the M parallel 
sets of outputs contains N sets of bits L i/in bits in length with 
i and L i(in being integers satisfying i - 1 to M and 1 £ L ijin < 
5 Kin . The ith set of outputs defines a respective subset of the 
Z bits of the inputs. For each parallel set of outputs, a 
parallel lookup table operation is performed to generate a 
corresponding parallel set of outputs containing N outputs, 
each being associated with a respective one of the N K in -bit 
10 inputs and each being Li.« bits in length. Li.-. i» "» integer 
satisfying U.~c * 1- For each of the M K in -bit inputs, a 
respective output is generated by performing a bit combining 
operation on the outputs from the parallel look-up table 
operations associated with the input. 

According to another broad aspect, the invention 
provides a method of generating a plurality of outputs 
according to a ciphering algorithm which for each of the 
plurality of outputs operates on a respective input using a 
respective key. The ciphering algorithm has a plurality of 
rounds in which functions are evaluated. For at Least one 
function of the functions of at least one of the plurality of 
rounds there is a plurality of first inputs each being 
associated with one of the respective inputs. For each first 
input and in parallel with other first inputs of the plurality 
25 of first inputs, the method involves generating an output by 
looking up at least one look-up table using the input, each 
look-up table having a plurality of elements. 

In some embodiments of the invention, the ciphering 
algorithm is a Kasumi algorithm. 

According to another broad aspect, the invention 
provides an apparatus for generating a plurality of outputs 
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according to a ciphering algorithm which for each of the 
plurality of outputs operates on a respective inpuc using a 
respective key. The ciphering algorithm has a plurality of 
rounds in which functions are evaluated. The apparatus has a 
5 processor and a memory adapted to store a plurality of elements 
of each of at least one look-up table. For at least one 
function of the functions of at least one of the plurality of 
rounds, the processor is adapted to: responsive to a plurality 
of first inputs each being associated with one of the 
10 respective inputs, for each first input and in parallel with 

other first inputs of the plurality of first input** generate an 
output by looking up at least one look-up table using the 
input, each look-up table having a plurality of elements. 

In some embodiments of the invention, the ciphering 
15 algorithm is a Kasumi algorithm. 

According to another broad aspect, the invention 
provides a method for which there is a plurality of inputs, 
each input being defined by one or more bits. For each input 
of the plurality of inputs and in parallel with other inputs of 
20 the plurality of inputs the method involves looking-up a look- 
up table having a plurality of elements using the one or more 
bits that define the input to obtain an output. 

According to another broad aspect, the invention 
provides an apparatus having a processor and a memory adapted 

25 to store a plurality of elements of a look-up table. There is 
a plurality of inputs, each input being defined by one or more 
bit. For each input of the plurality of inputs and in parallel 
with other inputs of the plurality of inputs the processor is 
adapted to look-up the look-up table using the one or more bits 

30 that define the input to obtain an output. 
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Preferred embodiments of the invention will now be 
described with reference to the attached drawings in which: 

Figure 1 is block diagram of a ciphering block 
operating on input data being transmitted at for example an RNC 
5 (Radio Network Controller) in a UMTS (Universal Mobile 
Telecommunications System) network; 

Figure 2A is a flow chart for the Kasumi algorithm; 

Figure 2B is a flow chart of an FO funcbion evaluated 
at each terminal of the Kasumi algorithm of Figure 2A; 

10 Figure 2C is a flow chart of an FI function evaluated 

for the FO function of Figure 2B; 

Figure 2D is a flow chart of an FL function evaluated 
for the FI function of Figure 2A; 

Figure 3 is a list of Equations for an S7 function of 
15 a Kasumi algorithm; 

Figure 4 is a list of Equations for an 259 function of 
the Kasumi algorithm; 

Figure 5 is a flow chart of a method of performing 
parallel look-ups using tables, according to an embodiment of 
20 the invention; 

Figure 6 is a flow diagram of elements being looked 
up in look-up tables and selected according to the method of 
Figure 5 as applied to an S7 function; 

Figure 7 is a block diagram of vectors being operated 
25 on during a vperm (vector permutation) instruction; 

Figure 8 is a flow chart of a method of performing a 
step in the method of Figure 5; 

8 
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Figure 9 is a flow chart of a method of selecting an 
output from two other outputs in method steps of Figure 8; 

Figure 10 is a block diagram of a vector being 
operated on during a vsel (vector select) instruction used in 
5 method step of Figure 9; 

Figure 11 is a flow chart of a method of performing 
parallel look-ups using tables, according to another embodiment 
of the invention; 

Figure 12 is a table listing into groups components 
10 x' p x' q of Equations of Figure 4 that are to undergo an exclusive- 

OR operation, in accordance with another embodiment of the 
invention; 

Figure 13 is a table listing for each group, of 
Figure 12, input bits used as indices into look-up tables and 
15 output bits returned by the look-up tables; 

Figure 14 is a table listing for each group, ordering 
of the input bits listed in Figure 13; 

Figure 15A is a block diagram of a vector being 
operated on during a vsrb (vector shift right byte) instruction 
20 used in method steps of Figure 11; 

Figure 15B is a block diagram of vectors being 
operated on during a vsel instruction used in method steps of 
Figure 11; 

Figure 15C is a block diagram of a vector being 
25 operated on during a vrlb (vector rotate left byte) instruction 
used in method steps of Figure 11; 
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Figure 15D is a block diagram of vectors being 
operated on during a vsel instruction used in method steps of 
Figure 11; 

Figure 15E is a block diagram of vectors being 
5 operated on during vslb (vector shift left byte) and vsel 
instructions used in method steps of Figure 11; 

Figure 15F is a block diagram of vectors being 
operated on during vsrb and vsel instructions used in method 
steps of Figure 11; 



10 



15 



20 



25 



Figure 16 is a block diagram of vectors being 
operated on during a vperm instruction used in method steps of 
Figure 11; 

Figure 17 is flow chart of a method of combining 
outputs obtained in a step of Figure 11; 

Figure 18 is a flow diagram showing how vectors 
containing outputs are combined by being operated on using 
exclusive-OR and bit manipulation operations; 

Figure 19A is a block diagram of an apparatus for 
implementing the methods of Figures 5 and 11; 

Figure 198 is a block diagram of the apparatus of 
Figure 19A implemented as a ciphering block; and 

Figure 20 is an operation flow diagram of an example 
implementation of a method of looking up tables in parallel. 

Detailed Description of the Preferred Embodiments 

In a ciphering algorithm an input is operated on 
using a key to generate an output, input data is chen combined 
with the output to produce ciphered dara. In the cipnering 

10 
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algorithm there are a plurality of rounds in which functions 
are evaluated. Some of these functions cannot be implemented 
in a simple manner for parallel computation on a number of 
inputs to generate a number of outputs in parallel. In some 
5 embodiments of the invention a method of generating a plurality 
of outputs according to such ciphering algorithms is 
implemented at least partially in parallel for a number of 
inputs and keys. In some embodiments of the invention, the 
ciphering algorithm is implemented entirely in parallel. 
10 Furthermore, in some embodiments of the invention the outputs 
obtained are combined, in parallel, with input data to generate 
ciphered data using, for example, exclusive-OR operations 
implemented in parallel. 

A parallel implementation of a Kasumi algorithm will 
15 be described as an illustrative example; however, it is to be 
clearly understood that the invention is not limited to a 
parallel implementation of the Kasumi algorithm and in other 
embodiments of the invention other ciphering algorithms are 
implemented in parallel. In order to describe a parallel 
20 implementation of the Kasumi algorithm, it is worthwhile to 

first look at the Kasumi algorithm with reference to figures 2A 
to 2D. The Kasumi algorithm has eight rounds 2000 of 
computations and at each round 2000 a number of functions are 
performed including Fd and FL t (t - 1 to 8) functions, FIi, g (g 
25 = 1 to 3) functions, S7 and S9 functions, exclusive-OR 
operations shown as ©, zero-extend operations, truncate 
operations, bitwise AND operations shown as n, bitwise OR 
operations shown as u, and one-bit left rotation operations 
shown as <«. The S7 and S9 functions can be evaluated using 
30 look-up tables each containing pre-determined elements. 

In some embodiments of the invention che Kasumi 
algorithm is implemented in parallel for a plurality inputs and 

11 
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keys to generate a plurality of outputs wherein functions of 
the algorithm are evaluated in parallel. In some embodiments, 
the algorithm is implemented entirely in parallel wherein each 
function of the algorithm is implemented in parallel while in 
5 other embodiments the algorithm is implemented partially xn 
parallel wherein at least one function of at least one of the 
rounds 2000 is implemented in parallel- Furthermore, as 
discussed above, the invention is not limited to the Kasumi 
algorithm and in other embodiments of the invention, other 
10 ciphering algorithms are implemented in parallel.. 

More generally, in some embodiments of the invention, 
a method is used to generate a plurality of outputs according 
to a ciphering algorithm which for each of the plurality of 
outputs operates on a respective input using a respective key. 
15 The ciphering algorithm has a plurality of rounds in which 

functions are evaluated. At least one of the functions of at 
least one of the rounds is evaluated in parallel. In 
particular, for a plurality of first inputs each being 
associated with one of the respective inputs, and in parallel 
20 with the other first inputs, the method involves generating an 
output by looking-up at least one look-up table using the first 
input wherein each look-up table has a plurality of elements. 
In other words, each look-up table is looked-up in parallel 
using the first inputs. Different methods of performing table 
25 look-ups in parallel will be described below. For the Kasumi 
algorithm, the parallel table look-ups might be used for any 
one or more of the S7 and S9 functions, for example. In some 
embodiment of the invention, other functions of the Kasumi 
algorithm such as the FO, and FU (i - 1 to 8) functions, FI t , g 
30 (g - 1 to 3) functions the exclusive-OR operations shown as e, 
zero-extend operations, truncate operations, bitwise AND 
operations shown as n, bitwise OR operations shown as u, and 
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one-bit left rotation operations shown as «< are evaluated in 
parallel using vector instruction available on SIMD (Single 
Instructions Multiple Data) architectures. 

A major part of the Kasumi algorithm consists of 
5 evaluating the S7 and S9 functions. The Kasumi algorithm is 
adaptable for implementation on a SIMD (Single Instruction 
Multiple Data) architecture such as that of a well known 
PowerPC processor having an Altivec co-processor, in which 
vector instructions are used to operate vectors and perform 
10 parallel computations on the data; however, the S7 and S9 

functions are not well suited for simple implementation on SIMD 
architectures. In particular, for a conventional evaluation of 
the S7 function of Figure 3 an output Y with bits y a (j = 0 to 
6) is made using tables with 2 7 - 128 7-bit elements. 
15 Similarly, for the S9 function an output Y' with bits y' k (k - 0 
to 8) is evaluated using tables with 2 9 = 512 9-bit elements. 
For a conventional evaluation of the S9 function, the table 
requires 9-bit elements because the input X' and the output Y' 
both have 9 bits. For a parallel implementation on a PowerPC 
20 processor having an Altivec co-processor, the look-up tables 
for both S7 and the S9 functions are too large to fit in a 
vector that is looked up using a single vector instruction. 
For example, for a PowerPC processor having an Altivec co- 
processor a vperm (vector permutation) instruction can be used 
25 to look-up tables. For the vperm instruction, a look-up table 
can be loaded into one or two vectors each capable of holding 
16 1-byte elements; however, the look-up tables for the S7 and 
the S9 functions have 128 and 512 elements, respectively. 
Therefore, the tables cannot fit in the one or cwo vectors used 
30 by the vperm instruction. Furthermore, for a PowerPC processor 
having an Altivec co-processor, there are 32 vectors each 
having 128 bits. As such, a maximum of 32 16-byte elements, 

13 
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for example, can be loaded into the vectors and therefore the 
look-up table for the S9 function cannot be loaded its entirety 
for look-ups. 

In some embodiments of the invention, for the S7 and 
5 S9 functions specialized tables are used to perforin parallel 
look-ups- The use of the specialized tables allows the S7 and 
S9 functions to be evaluated in parallel using a few 
instructions and this allows the Kasumi algorithm to be applied 
in parallel on for example a SIMD (Simple Instruction Multiple 
10 Data) architecture to achieve a high performance. 

As a broad introduction to methods of performing 
look-ups in parallel, a method will now be described and then 
as an illustrative example the method will applied to the S7 
function of the Kasumi algorithm. Similarly, another method 
15 will be described and then an illustrative example of the other 
method will be applied to the S9 function. 

Referring to Figure 5, shown is a flow chart of a 
method of performing parallel look-ups using tables, according 
to an embodiment of the invention. The method takes as inputs 

20 two or more inputs Xi and outputs two or more outputs Yj, The 
inputs are each defined by a first set of bits and a second set 
of one or more bits. A function that maps the inputs X r onto 
the outputs Yj is represented by two or more tables each having 
a plurality of elements for look-up by the first set of bits of 

25 each of the inputs X x . At step 410, for each input X x and in 
parallel with other inputs Xi one of the elements of each look- 
up table is looked up using the first subset of bits that 
define the input to obtain outputs. It is to be understood 
that each table is looked up in parallel using che first subset 

30 of bits of each input. For each input X lf the outputs 

collectively form a set of corresponding outputs. At step 420, 
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for each input X, and in parallel with the other inputs, a 
corresponding output of the set of corresponding outputs xs 
selected using the second set of one or more bits that define 
the input X X . Again it is to be understood that, at step 420 the 
5 selection is made in parallel with other selections for other 
inputs, Xi. 

As an illustrative example, the method of Figure 5 
will now be applied for evaluating the S7 function of the 
Kasumi algorithm with reference being made to Figures 3 and 6 
10 to 10. It is to be clearly understood that what follows is only 
one example implementation falling in the broad Language of 
Figure 5. 

As shown by Equations 200 to 206 in Figure 3, the S7 
function has X as an input and has Y as an output with X and Y 

15 being defined by 7 bits and y 3 , respectively. As such, in 
applying the method of Figure 5 to evaluate the S7 function, X x 
= X, and Yj = Y. Since the input X has 7 bits x L , there are 2 7 
- 128 possible values for Y in evaluating the S7 function. In 
the illustrated embodiment of the invention, each possible 

20 value for Y is pre-determined and stored in a memory as one of 
2 7 = 128 elements. The 128 elements form look-up tables and for 
each input X, the elements from the look-up tables are looked- 
up and rhen one of the elements is selected. 

In Figure 6, shown is a flow diagram of elements 
25 being looked up in look-up tables and selected according to the 
method of Figure 5 as applied to the S7 function. In 
particular, the flow diagram of Figure 6 is used to illustrate 
the method steps 410, 420 of Figure 5 for a specific input X. 

In Figure 6, the 128 elements are shown as elements 
30 520 (only 20 elements 520 are shown for clarity). Each element 
520 has a pre-determined value 530 shown as S7 (XgXsx^XzXxXo) 

15 
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which is a function of a bit sequence 575 * 6 x s * 4 x 3 x S: x lX o 

w ^ <?7 function of Figure 3. In the 
illllll as given by the S7 function u * 
to llllii-J- y elements 520 

j fioure 5, for each input X, one ox tne 
17 el e Z 11,1, on a value the input X is carrying As 
5 such for purposes of illustrating how one of the electa 520 
5 such, tor p values for the bit sequences 

is selected, for each input X the values 

575 are explicitly shown as nosers rather than having the pre 
determined values 530 being shown explicitly. 

in the illustrative example, the method of Figure 5 
10 is implemented on a PowerPC processor having an Altivec co- 
processor. A respective vperm .vector permutation, instruction 
is used at step 410 for performing look-ups in each look-up 
table and vsel (vector select) instructions are used at step 
420 to select a corresponding output for each input X. 
15 Further details of this particular embodiment will be 

described both generally and with reference to a specific input 
value for X = >«*W*. = 1001010 in base-2 notation, which 
corresponds to X = 14 in base-10 notation. 

A single vperm instruction, as described in detail 
20 below, can be used to operate on inputs vectors 

vAte, »..). vB<e,> e„, b ) using a vector 

vc(e lc ,.,.) with each of these sectors having 2 - 1- 

by te elements e », and I. • 1 t, 16), respectively. 

The vperm instruction return a vector vDC.a «..,*) having 

Tn navicular, for each element e W /d 
25 - 16 1-byte elements e w/d . In parta.cuj.at, 

-=, , .\ one of the elements e w/a of the 
of the vector vD(ei,d, ■ • • # e.e.d) one 01 

vector vA(e l . e 15 „) and the elements e„. b of the vector 

„,„., .«..) i= selected using 5 bits of a respective one of 

the 1-byte elements of the vector vC(e 4 ,= «..) • 

30 Alternatively, in other embodiments of the invention, a single 
vperm instruction can be used to operate on the vector 

16 
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vA(ei, a , . . . ,e l6 ,a) using vector vC (ei, c , . . ei 6 , c ) and return the 
vector vD(e x ,d/ - - - , eie,d) , wherein for each element e w , d of the 
vector vD(ei, d , . . . ,ei6,d) one of the elements e w/a of the vector 
vA(ei, a/ . . . ,ei6, a ) is selected using 4 bits of a respective one of 
5 the 1-byte elements e w , c of the vector vC (e 1/C/ . . . , e lt , c ) . 

In the illustrative example, the vperm instruction is 
used to operate on vectors vA (ei, a , . . . , ei 6 ,a) , vB (e itt> , . , e 16/b ) 
using vector vC (ei, c , . . eie, c ) each having the 16 1-byte elements 
e w^a/ £w,t>/ and e w , c , respectively. In particular, the vperm 
10 instruction operates on 16 elements of a 32-element look-up 
table that is loaded as vector vA (e x , a , . . e 16 , a ) and another 16 
elements of the 32-element look-up table that is loaded as 
vector vB(e x , b , . . . ,e ie , b ) with the 16 inputs X beinv? loaded as 
vector vC(ei /C , . . . ,e 16 , c ) . 

15 Recall with reference to Figure 5, that each input X 

has a first set of bits and a second set of bits. There is a 
respective look-up table for each permutation of the second set 
of bits. In other words, all elements of a given look-up table 
will contain Y values determined for a set of X values sharing 

20 a common second set of bits. 

For the example of Figure 6, each X input is 7 bits, 
and has a 5-bit first set of bits and a 2-bit second set of 
bits. The first set consists of the least significant bits 
while the second set consists of the most significant bits. 

25 There is a respective look-up table for each permutation of the 
second set of bits, in this case requiring four look-up tables 
540 each containing 2 s = 32 elements. Each look-up table 540 
has portions 550, 560 each having 16 elements 520 to be 
operated on by the vperm instruction as vectors vA (e L ,a/ - - . / ei 6 , a ) 

30 and vB (ei, b , . - - , ei 6 ,b) / respectively. 
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A step 581 in the flow diagram of Figure 6 is 
illustrative of step 410 of Figure 5 wherein for each input X, 
one element 520 is looked-up for each look-up table 540 to 
obtain outputs. Outputs from the look-up tables 540 from step 
5 410 are shown as groups of outputs 591, 592, 593, 594 with each 
group of outputs 591, 592, 593, 594 having 16 oubputs (only one 
output in each group of outputs 591, 592, 593, 594 is shown for 
clarity). Outputs from the groups of outputs 591, 592, 593, 
594 form sets of corresponding outputs. For example, outputs 

10 506 from the groups of outputs 591, 592, 593, 594 form a set of 
corresponding outputs. Each output of the groups of outputs 
591, 592, 593, 594 has a pre-determined value S7 (x 6 x 5 x 4 X3X2X l xo) 
which is a function of a bit sequence 514 and for each set of 
corresponding outputs the bit sequences 514 have the same 5 

15 least significant bits but different 2 most significant bits. 
For the example input with X « x 6 x 5 x 4 x 3 x 2 x 1 xo = 1001010, the bit 
sequences 514 of corresponding outputs 506 all hctve the same 5 
least significant bits 01010 but different 2 most, significant 
bits 00, 01, 10, 11. 

20 Step 420 of Figure 5, in which for each input X, a 

corresponding output of the set of corresponding outputs is 
selected is shown as a two step process in the flow diagram of 
Figure 6. In a first selection 582, a group of outputs 596 is 
selected from groups of outputs 591, 592 and a group of outputs 

25 598 is selected from groups of outputs 593, 594. The groups of 
outputs 596, 598 each have 16 outputs (only one output 508 is 
shown in each group of outputs 596, 598 for clarity) . Outputs 
from the groups of outputs 596, 598 form sets of corresponding 
outputs. For example, outputs 508 from the groups of outputs 

30 596, 598 form a set of corresponding outputs. Each output of 

the groups of outputs 596, 598 has a pre-determined value 

S7 (x 6 x 5 x 4 X3X 2 XiXo) which is a function of a bit sequence 516 and 

for each set of corx-R.qpondincr outputs the bit sequences 516 

18 
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' have the same 6 least significant bits but a different most 
significant bit. For the example input with X ~ x 6 x 5 x 4 X3X2XiX 0 = 
1001010, the bit sequences 516 of corresponding outputs 508 
both have the same 6 least significant bits 001010 but 
5 different most significant bits 0,1. In a second selection 
583, a group of outputs 599 is selected from the groups of 
outputs 596, 598 with the groups of outputs 599 having 16 
outputs (only one output 511 is shown in the group of outputs 
599 for clarity) . Each output of the group of outputs 599 has 
10 a pre-determined value S7 (x 6 x 5 x 4 x 3 X2Xix 0 ) which is a function of a 
bit sequence 517 that corresponds to a respective input X. For 
example, the bit sequence 517 of output 511 has a value that 
corresponds to the example input X = X6X5X4X3X2X1X0 • 1001010. 

In the illustrative example, each of the 16 inputs X 
15 has 7 bits Xi of which there is the first set of bits having 5 
least significant bits x 4 x 3 x 2 xix 0 and the second set of bits 
having 2 most significant bits x 6 x 5 . For our specific example, 
the input has a value X = x 6 x 5 x 4 X3X2Xixo = 1001010 in base-2 
notation with the order of significance from most significance 
20 to least significance being from left to right. The first set 
of bits for the input corresponds the 5 least significant bits 
01010 of X = X6X5x 4 x 3 x 2 x 1 xo = 1001010 and the second set of bits 
for the input correspond the 2 most significant bits 10 of X - 
X6XsX 4 X 3 X 2 XiXo - 1001010. 

25 At step 410 of Figure 5, for each look-up table 540 

the vperm instruction is used to perform a look-up in the look- 
up table 540 using the first set of bits of each of 16 inputs 
X. Thus four vperm instructions are used to look-up the four 
look-up tables 540. 

30 The vperm instruction will now be described with 

reference to Figures 6 and 7. In Figure 6, the look-up tables 

19 
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540 are shown each having portion 550 and portion 5*0. At step 
410, for each look-up table 540 a vperm instruction operates on 

vectors vA(«i ,ei«,.> 610 and ^ (e 1/b , . . . , e l6 , b ) 620 using 

vector vC(ei, c ,..wei6,e) 630 to return a vector vD(©i,c exe.d) 

5 640. The vectors vA(e 1(t , . . . ,e lM ) 610 and vB (e 1(l) , . . . , e l6 , b ) 620 
contain elements 520 from the portions 550 and 560, 
respectively, of the look-up table 540 being looked-up, and the 
vector vC(ei, c ,...,e l6 ,c> 630 contains the 16 inputs X. The 
vector vA(ex, a ,...,e 16 , 4 ) 610 has 16 1-byte elements e w ,c 615 each 
10 addressable using an index from 0 to F in base-16 notation, or 
equivalently from 00000 to 01111 in base-2 notation. The base- 
16 notation is used for purposes of clarity in Figure 7 to 
prevent cluttering. Each element e*,, 615 contains one of 
elements 520 from portion 550 of the look-up table 540 being 
15 looked up. Similarly, the vector vB (ei, b , . . . , ei 6 , b ) 620 has 16 1- 
byte elements e W(b 625 each addressable using an index from 10 
to IF in base-16 notation, or equivalently from 10000 to 11111 
in base-2 notation. Each element e„, b 625 contains one of 
elements 520 from portion 560 of the look-up table 540 being 
20 looked up. 

For the vector vC(ei, e , . . . ,ei 6 , c ) 630, the 16 inputs X = 
X6XSX4X3X2X1X0 are shown as elements e W/C 635 and the 5 least 
significant bits x«, x 3 , x 2 , xi, x 0 , which form the first set of 
bits, of each of the 16 inputs X - x 6 xsx 4 x 3 x 2 XiXo are used as 

25 indexes for fetching a respective element of eithec an element 
e M , a 615 of vector vA(ei, a , . . . ,e l6 , a ) 610 or an element e M/b 625 of 
vector vB(ei, b/ . . .,e 16(b ) 620 resulting in the vector 
vD(e 1<d , . . . ,ei6,d) 640. Example values in base-lti notation for 
the 5 least significant bits x<, x 3 , x 2 , xi, x 0 cf each of the 16 

30 inputs X - X6X 5 x,x 3 X2XiXo are shown as A, 7, 0, 15, 5, 9, 13, 15, 

2, 16, 19, 1A, A, IF, C, IB in elements e W(C 635 of vector 

vC(e lfC , . . -,ei 6 ,c) 630. For our specific example input, X - 

« 6 « 3 k,i«3kz«i»«o - 1001010 has 01010 as its 5 least sionificant 

20 
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bits, the 5 least significant bits 01010 corresponding to A in 
base-16 notation as shown within one of the elements e u , c 635 of 
vector vC(e 1 , c ,...,e 16 . c ) 630. During the vperm instruction, the 
5 least significant bits of each input X represented as A, 7, 
5 0, 15, 5, 9, 13, 15, 2, 16, 19, 1A, A, IF, C, IB in base-16 
notation in elements e„, c 635 of vector vC (ei, e , . - • , &i6.c) 630 are 
used to fetch a respective one of a respective element of 
either an element e w , a 615 of vector vA (ei, a , . . . , e 16 , a ) 610 or an 
element e w , b 625 of vector vB (e llb , . - . , e l6(b ) 620 resulting in the 
10 vector vD (e 1>d , . . . , ei 6 , d ) 64 0. Each element fetched is output as 
one of the elements e„. d 645 of vector vD (e 1<d , . . . , e l6(< j) 640. For 
each vperm instruction, the vector vD (ei, d , . . - , ei 6<d ) 640 results 
in one of the groups of outputs 591, 592, 593, f>94 shown in 
Figure 5. 

15 As discussed above, the outputs from the groups of 

outputs 591, 592, 593, 594 collectively form sets of 
corresponding outputs and for each input X the bit sequences 
514 have common 5 least significant bits but different 2 most 
significant bits. For example, referring back to Figure 6, for 

20 the specific example input with X = X6X 5 X4X3X2Xix„ = 1001010, the 
look-ups in look-up tables 540 using the 5 least significant 
bits 01010 as indexes in the vperm instructions result in the 
outputs 506 having pre-determined values S7 (x 6 x E x 4 X3X2Xixo) which 
are functions of the bit sequences 514 having common 5 least 

25 significant bits 01010 but different 2 most significant bits. 
In particular, one of the pre-determined values S7 (X6X5X4X3X2X1X0) 
of the set of corresponding outputs 506 is a function of the 
example input X = X6X5X4X3X2X1X0 = 1001010. 

In this specific illustrative example, at step 410, 
30 there is a total of 4 vperm instructions, and for each input X 
the number of possible outputs from the 128 elements 520 have 

21 
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been narrowed from 128 possible outputs down to A possible 
outputs . 

With the outputs from the groups of outputs 591, 592, 
593, 594 collectively forming sets of corresponding outputs, at 
5 step 4 20 one corresponding output from each set of 

corresponding outputs is selected. For our specific example, 
one of the four corresponding outputs 506 is selected. The 
selection is made using the second set of bits x e , *s that 
define the specific example input with X = x 6 x^^:z^2^i^o = 

10 1001010. In particular, the specific example input with X = 
x 6 x 5 x 4 X3X2XiXo = 1001010 has 10 as its second set oE bits. As 
described in detail below with reference to Figures 6 and 8, 
the selection is performed by successively performing a 
selection on a remaining number of corresponding outputs for 

15 each set of corresponding outputs, wherein each i;im«3 the 

selection is made the number of remaining corresponding outputs 
is halved. This selection will now be described for the 
illustrative example with reference to Figure 8. 

Referring to Figure 8, shown is a flow chart of a 
20 method of performing step 420 of the method of Figure 5. In 
Figure 8 for each input X, two outputs are selected from the 
four outputs obtained using a bit from the second set of bits 
that define the input (step 710) . After step 710, there are 
two outputs for each input X - X6X 5 x 4 X3X2Xix 0 and one of the 
25 outputs is selected using another bit from the second set of 
bits that define the input (step 720) . Referring back to 
Figure 6, step 710 is illustrated by the first selection 582 in 
which for each set of corresponding outputs one half of the 
corresponding outputs are selected. For example, for the 
30 specific input with X = X6X5X4X3X2X3X0 = 1001010, of the four 

corresponding outputs 506 two outputs 508 are selected. Step 
720 is illustrated by the second selection 583 in which for 

22 
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each set of remaining outputs one half of the remaining outputs 
is selected. For example, for the specific example input with 
X = X6X5X 4 X3X2Xix 0 = 1001010, of the remaining outputs 508, one 
output 511 is selected. 

5 In the illustrative example, as discussed above the 

selection of outputs at steps 710 and 720 is performed using an 
Altivec vsel instruction. The vsel instruction will now be 
described in detail with reference to Figures 9 and 10. 

Referring to Figure 9, shown is flow chart; of a 

10 method of selecting an output from two other outputs* in the 
method steps 710, 720 of Figure 8. For each input X, one of 
the bits of the second set of bits that define the input X is 
replicated as a 1-byte element (step 810) and th*n the vsel 
instruction is applied using the replicated bit of cjach input X 

15 (step 820) . The method of Figure 9 will now be applied to 
obtain the outputs 596 of Figure 6. To obtain the group of 
outputs 596, at step 810 for each input X = x 6 x 5 x.,x 3 X2XiXo, the 
least significant bit x 5 of the second set of bits x 6 , *s that 
define the input is replicated. For example, th*?. second set of 

20 bits x 6 , x 5 of the specific example input with X - X5X 5 x 4 X3X 2 XiXo - 
1001010 corresponds to 10, which has 0 as a least, significant 
bit. As such, the bit 0 of is replicated as a l-byte element 
represented as 00000000. At step 820 the vsel instruction 
operates on the groups of outputs 591 and 592 as vector 

25 elements using the replicated bits of each input X ^ 
X6X5X4X3X2X1X0. 

In particular, in Figure 10 the vsel instruction 
operates on vectors vA 2 (f i,*, - - . , f i6,a) 910 and vB 2 ( f - . - f f ie,b) 
920 using vector vC 2 (f 1/C , . . . , f l6 , c ) 930. The vectors 
30 vA 2 (fi, a , - - -,fi6,a) 910, vB 2 (fi, b , . . .,fie.b) 920, and 

vC 2 (fi, C / . . . ,fu,c) 930 have 128 1-bit elements f c , a 915, f c , b 925, 

23 
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15 



d f 935 (t = 1 to 128), respectively, (only 8 elements f„. 

915 Inly 8 elements fi lf > 925, and only 8 elements f c , c 935 are 
913, onxy operates 
shown for clarity) . The vector vC 2 (f, c fu.c> P 

_ f ,. \ 910 and vB 2 (f i.b, • • • ' f i*.*>' yzu 

on vectors vA 2 (fi.«/ • • • ' f . 

resulting in a vector v^.f,* £>,«> 940 having 

elements f„< 945 (only 8 elements «... 945 are shown lor 

clarity, . In particular, for each elements U.< 9,5 of the 

r ,* fi..) 930, If the element f,.c 935 contains a 

vector VC2 (fi,e# - • • ' tie.cJ 

»0« a corresponding element f„. 915 from the vector 

v^f, <»..> »« i' " - element for th« = vector 

,0,(f l f».a) 940 and if the element U, 935 contains a . 

a corresponding element f,.» 925 from the vector 
vBa(£l , f«,>> 920 is selected as an element for the vector 

vDjtfi.a, fu.a) 940 • 

To obtain the group of outputs 596, a vsel 
instruction operates on the outputs 591, 592 as vectors 

- ,r f,= „) 920, respectively, using 

vA.tfx W 910, vB,(£,.a *".»> * 

the replicated bits of each input X as elements U„ 935 of the 

vector vc,<f,« 930. In Figure 10, the • 

20 915 shown as 001U1U represent the predetermined value of the 
output 506 which is a function of the bit sequence 514 with 
0001010 in base-2 notation. In particular, for an input 
corresponding to 0001010 in base-2 notation the 57 function 
outputs a value of 63 in base-10 notation, which corresponds 
25 00111111 in base-2 notation. Similarly, the 8 elements 

925 shown as 00101000 represent the pre-determined value of 
out put 506 which is a function of the bit sequence 514 with 
0101010 in base-2 notation. In particular, for an input 
corresponding to 0101010 in base-2 notation the S7 function 
30 outputs a value of 40 in base-10 notation, which -""^ " 
00101000 in base-2 notation. The 8 elements U.. 935 shown each 
containing »0" correspond to the replioated bit * - 0 from the 
specific exampl. "«•*» * - »«*.x.K>x»xixo - 1001010. 
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elements f t , c 935 are used to select the 8 elements £ t , a 915 as 
elements f t ,d 945 of the vector vD 2 <f i,d, . . . , fie,<i) 940. The 8 
elements f r , d 945 shown correspond to the output 508 having 
associated with it the bit sequence 516 corresponding to 
5 0001010. 

The vsel instruction is also used at step 710 to 
obtain the group of outputs 598; however, in this case the vsel 
instruction operates on groups of outputs 593, 5 94 as vectors 
vA 2 (fi, a , • - -/fie,*) 910 and vB 2 (fi,br - * - ,fi6,b> 920, respectively. 

10 Finally, the vsel instruction is used to obtain the group of 
outputs 599 at step 720 by operating on the group of outputs 
596, 598 as vectors vA 2 (fi, fi6,a) / 910 and vE 2 (f i,b/ . • - / f ie,b) 
920, respectively, using replications of the most significant 
bit x 6 of the second set of bits x 6 , x 5 of each input X ^ 

15 X6X 5 x 4 x 3 X2XiXo as vector vC 2 (f \ tQ , . - . , f i6,c) • 

Referring back to Figure 6, in the illustrative 
example the vperm instruction makes use of the 5 least 
significant bits x 4 , x 3 , x 2 , Xi, x 0 of each input with X - 
x 6 x 5 x 4 X3X2X 1 xo as a first set of bits to look-up the look-up 

20 tables 540. The vsel instruction then makes use of the two 

most significant bits x 6 , x 5 of each input with X = x 6 xsX4x 3 x 2 XiX 0 
as a second set of bits to select outputs from the vperm 
instructions. Alternatively, in some embodiments of the 
invention the first set of bits of each input has 4 bits x 3 , x 2 , 

25 Xi, x 0 and the second set of bits of each input has 3 bits x 6/ 
x 5 , x 4 . In such embodiments of the invention thfc vperm 
instruction looks up look-up tables of 16 elements using the 
first set of bits of each input X resulting in 8 corresponding 
outputs for each input X » xgX5X 4 x 3 x 2 xxxo • A number Nv ge i = 7 of 

30 vsel instructions are then used to select one of the 

corresponding outputs of each input X = X6XsX4x 3 x.»XiXo . In the 
above examples, the vperm instruction is used to look-up tables 

25 
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of 32 1-byte elements or tables of 16 1-byte elements; however, 
other implementations are possible. For example, in some 
implementations the vperm is used to look-up tables of 16 
2-byte elements, 4 8-byte elements, or 2 16-byte elements. 

5 Furthermore, in the embodiments of Figures 5 to 10, 

for each input with X = X6X5X4X3X2X1X0, the first set of bits 
corresponds to least significant bits x 4 , x 3 , x 2 , xi, x 0 and the 
second set of bits corresponds to most significant bits x 6 , x 5 ; 
however, the invention is not limited to such embodiments, and 
10 in other embodiments of the invention when using the vperm 
instruction for each input with X = x 6 xsX4X3X 2 xix 0( any 4 or 5 
bits of the bits Xi are used for the first set of bit and the 
remaining bits Xi are used for the second set of bits. This is 
achieved by storing the pre-determined values of the elements 

15 520 in a different order than shown in Figure 5. 

In the illustrative example, there are four look-up 
tables being looked-up using vperm instructions, the four look- 
up tables collectively forming a larger table referred to as a 
super table. The number of tables a super table is divided 

20 into depends on the number of elements in the super table. In 
particular, in some cases the number of elements is low enough 
for the super table to be loaded and then looked-up using a 
single vperm instruction. For such cases, the method of Figure 
5 can be modified by looking up only one look-up table at step 

25 410 and not performing step 420. As such, in some embodiments 
of the invention, there is a method in which for each of a 
plurality of inputs and in parallel with the other inputs a 
look-up table having a plurality of elements is looked-up using 
the input. 

30 The above illustrative example has been described in 

the context of the S7 function of the Kasumi algorithm in which 

26 
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the input Xi = X and the output Yj = Y with both X and Y each 
being defined by N x = 7 bits and N y = 7 bits, respectively; 
however, the invention is not limited to the S7 function. In 
some implementations operations are performed for N x > 1 and 
5 N y £ 1. Furthermore, in the example implementation. N x - N y ; 

however, in other implementations N x * N y . The invention is not 
limited to the method being applied on an architecture 
corresponding to a PowerPC processor having an AltLvec co- 
processor and is also applicable to other SIMD architectures 
10 capable of implementing computations in parallel. Furthermore, 
a maximum for N x and N y is imposed only by the instructions 
available for performing look-ups, and in embodiments of the 
invention the maximum number of bits defining the output Y, is 
imposed only by the instructions available on the architecture 
15 on which the method is applied. 

Another limitation of the architecture corresponding 
to a PowerPC processor having an Altivec co-processor is with 
the use of the vperm instruction which makes use of only 4 or 5 
bits of the inputs X for look-ups. However, in other 

20 embodiments of the invention for an input being defined by N x 
bits, depending on the architecture in which the methods of 
Figures 5, 8, and 9 are applied the first set of bits of an 
input X has two or more bits and the second set of bits has at 
least one bit. Preferably, in order to allow a parallel 

25 implementation, a vector permutation operation is used. 

However, other processors will provide other operations, or 
custom operations may be defined. 

Another method of using look-up tables for parallel 
implementations will now be discussed with reference to Figure 
10 and then as an illustrative example, the method will applied 
to the S9 function of the Kasumi algorithm. 



30 
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Referring to Figure 11, shown is a flow chart of 
another method of performing parallel look-ups using look-up 
tables, according to another embodiment of the invention. The 
look-up tables each have a plurality of elements and are used 
5 to obtain outputs Y£ from inputs X' L . The method of Figure 11 
is described for one of the inputs X' t only; however, the method 
is applied to the inputs X' L in parallel. Each input X' L is 
defined by a first plurality of bits and at step 1010, for each 
look-up table a subset of bits of the first plurality of bits 
10 is selected and the look-up table is looked up using the subset 
of bits to obtain an output. Each subset of bits contains 
fewer bits than the number of bits that define the input. At 
step 1020, the outputs are combined. 

As an illustrative example, the method of Figure 11 
15 will now be applied to the S9 function in which X' L = X' and 

Y' K = Y'. It is to be clearly understood that what follows is 
only one example implementation falling in the broad language 
of Figure 11. The illustrative example will show how the 
method of Figure 11 can be applied to the S9 function in a 
20 parallel implementation. However, before the mothod of 
Figure 11 is applied to the S9 function it is worthwhile 
examining the S9 function in more detail. 

Referring back to Figure 4, the "AND" and exclusive- 
OR operations of Equations 300 to 308 are both commutative and 
25 associative. As such the order of the operations in Equations 
300 to 308 can be changed without affecting the result. For 
example, Equation 300 written as 

y ; = x' 0 x' 2 © x' 3 © xX © x;x' s © *x © xX © xX © xX © (1) 

x' 5 x' a © x 7 x' e © 1 
may be re-written as 

28 
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y ' 0 = x' 2 x; © x; © x; x ' 2 © x ; x ; e x' lX ; © x' 2 x' 7 © x' 4 x; © *' 5 k © (2) 
X ' 5 x' 8 © xx © i 

with the order of operation in which the components x'pX', 
undergo exclusive-OR operation being changed. 

With the understanding that Equations 300 to 308 are 
5 independent of the order of operation of the components x' p x' q , 
x p , and "I", the components x p x;, x' p , and "1" of each will now 
be grouped into groups for which look-up tables will be 
generated for implementation using the method of Figure 11. In 
particular, each look-up table will be generated as a partial 
10 evaluation of the S9 function. A description of how the look- 
up tables are generated as partial evaluations of the S9 
function will now be described with reference to Figures 12, 
13, and 14. 

Referring to Figure 12, shown is a table generally 
15 indicated by 1100 listing into groups the components x' p x' q of 
Equations 300 to 308 of Figure 3 that are to undergo an 
exclusive-OR operation, in accordance with another embodiment 
of the invention. Columns 1150, 1151, 1152, 1153, 1154, 1155, 
1156, 1157, 1158 list each component x p x', of Equations 300 to 
20 308 of Figure 3 used for obtaining bits 

v'o. y;, y' 2 , y' 3 , y' 4 , y' 5 , So Vi* yi* respectively. In particular, "AND 
operations are listed in short form as x p x' q representing 
x'p n x q . Also listed in table 1100 are components 
corresponding to x' p and "1". The component x' p indicates that 
25 x'p is to undergo an exclusive-OR operation. Similarly, the 
component "1" indicates that a bit corresponding to 1 is to 
undergo an exclusive-OR operation. The components x' p x'q, x' p , 
and *1" are also shown organized into groups labeled group 1 
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1110, group 2 1120, group 3 1130, group 4 1140, group 5 1150, 
group 6 1160. Each group 1 1110, group 2 1120, group 3 1130, 
group 4 1140, group 5 1150, group 6 1160 has at least one 
column 1150, 1151, 1152, 1153, 1154, 1155, 1156, 1157, 1158 in 
5 which there is no component x' p x' Q , x p , or *1". 

Recall with reference to Figure 11, that for each of 
a plurality of look-up tables the look-up table is looked-up 
using the respective subset of bits which has fewer bits than 
the plurality of bits of the input X' . In order to facilitate 

10 building this look-up functionality (described below) , within 
each group 1 1110, group 2 1120, group 3 1130, group 4 1140, 
group 5 1150, group 6 1160 there are 4 or 5 bits x,. (out of a 
possible 9 input bits) which can be used to generate all the 
components x' p x' q and x' p within the group. For example, within 

15 group 1 1110, bits x' 2 , x f 3 , x; , x' 5 are shown as pare of the 
components x' p x' q . These 4 or 5 bits of each group will be a 
reS p ec tive subset of the 9 bit input which will be used to 
perform a look-up in a respective look-up table. In the 
example of Figure 12, there are 6 groups thus requiring 6 look- 

20 up tables. More specifically, in the illustrative example, for 
each group 1 1110, group 2 1120, group 3 1130, group 4 1140, 
group 5 1150, group 6 1160 a respective look-up table is to be 
looked-up using a subset of 4 or 5 bits. For each look-up 
table, each bit will contribute to a respective ono of 8 of 9 

25 outputs y' x . Only 8 of 9 outputs y' x are generated because each 
group 1 to 6 has at least one column in which ther^ is no 
component . 

In a preferred embodiment of the invention, the 

illustrative example, look-ups in look-up tables are made using 

30 the previously described vperm instruction. The vperm 

instruction will make use 4 or 5 bits of the 9 bits x' n of the 

30 
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Input X' as indexes into vector, ana returns a 1-byt. output, 
^armore, the vperm instruction wiXl be used to perform 
look-ups in look-up tables in parallel for 16 
particular, in see cases the vperm instruction w.ll operates 
on one vector having 16 1-byte events using 4 bxts of the 9 
bits x; of the 16 inputs X' as indexes into the vector, and in 
other oases the vperm instruction will operate on two vectors 
each having 16 1-byte elements using 5 bits of the 9 bits x p of 
the 16 inputs X' as indexes into the two vectors. Finally, at 
step 1020 for each for each input X' , the outputs obtained are 
combined to obtain the bits y' t of 



10 



in Figure 13, the subsets of bits selected from the 
bits x'„ to be used to look-up the look-up tables of each of 
oroups'l to 6 are identified by check marks in a set of columns 
15 1230 of a table generally indicated by 1200. A number of brts 
x . M be used to look-up the look-up table of each group 1 to 6 
il listed in a columns 1240. Recall that the vperm instruction 
outputs a 1-byte output and therefore, in the illustrative 
example, each output to be combined will have fewer bits than 
20 the 9 bits yi . The bits y', for which outputs to be combined 
are determined, are shown in Figure 13 listed ir, a set of 
columns 1210 for each of the groups 1 to 6. The check marks 
identify the bits v'l which are dependent on the subset of bit 
identified in the set of columns 1230; the Xs identify the brts 
25 for which an output bit of an output to be combined is grven 

a value of zero; and the blank spaces indicate that there is no 
output bit being generated. For example, for group 1 there are 
outputs for the bits y'„ y'„ yi, y'„ y',- " owsver ' f ° r 

group 1 outputs for the bits y' s , y'„ yi are not dependent on the 
bits x; and are set to zero. Furthermore, for group 1, there 



30 

31 



Jan-29-2004 



IS: 11 From-S&B/F&Co 
16175RO 



+613 



T-74B P. 044/1 21 F-469 



is no output bit obtained for the bit y!, . The number of bits 
being generated that depend on the bits x' p is shown in a column 
1220 of table 1200 for each of groups 1 to 6. 

Referring back to Figure 12, each group 1 to 6 
5 defines a set of Equations used to generate a look-up table. A 
description of how look-up tables are generated will now be 
described for group 1. In the illustrative example, for any 
group u (u = 1 to 6) the output bits of the set of columns 1210 
are expressed as y' v>u (v - 0 to 8) . For group 1 an output to be 
10 combined is expressed as a partial output of 8 bits 

y'o.i/ yi.i' y* 2(1 , y' 3tl , y'<.u y' Sll , y' 6 .u yi.i for the bits 

y'o, y'u y'zr y'„ y\< y*> Vs* respectively. The bits 

y' 0/1 , y[,u y'iA. *w yU ^ are °° tained f com the 

components x' p x' q from group 1 and are given by 



15 



y' 0 ,i = AA © A 

y\,X - AA 

/».» - o 



y',.i = AA ,3) 



20 yi.a = 0 

y' s ,x - o 

y's.i - AA 

y' B .i - AA 
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Equation (3) defines a set of Equations for 
generating a look-up table for group 1. In particular, in the 
illustrative example, the look-up table being generated has 2 4 = 
16 l-byte elements for the 2* = 16 possible combinations of 
5 values for the bits x' 2 , x' 3 , x^ , x' 6 . Similarly, look-up tables 
are generated for groups 2 to 6. 

Given the look-up tables for groups 1 to 6, a brief 
description of how outputs from the look-up tables can be 
obtained and then combined will now be described for bit y' 0 , 
10 The brief description below will illustrate how outputs can be 
obtained from look-up rabies and then combined. As indicated 
in the set of columns 1210 of table 1200, non-zero output bits 
for bit y' 0 are obtained from the look-up tables of groups 1, 3, 
and 6 and are expressed as y' 0/l , y' 0t2 , y' 0 , 6 , respectively. The non- 
15 zero output bits y' 0fl , , y' 0/6 are given by 

y ; fl = x' a *' s e x ; 

y ;' /6 = x' 4 x' 8 © x; x ; © x' 5 x' 9 © x^x; © 1 

Combining the non-zero output bits y^ , y^ 3 , y' 0/6 using exclusive- 
OR operations resulting in 

20 Equation (5) is equivalent to Equation 300 of Figure 4 and 
illustrates how bits can be looked-up using a plurality of 
look-up tables and then combined. 

In the illustrative example the method of Figure 11 
is applied to the S9 function. At step 1010, for each input X' 
25 an output is generated for each of the look-up tables of groups 
1 to 6 and the outputs are combined at step 1020. Further 

33 
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derails of steps 1010, 1020 of the method of Figure 10 will now 
be described for a PowerPC processor having an Altivec co- 
processor in which vperm instructions are used to look-up the 
look-up tables. 

5 The vperm instruction makes use of the least 4 or 5 

bits of an input; however, in the set of columns 12 30, for each 
group 1 to 6 the bits x' p that are to be used for looking-up a 
respsctive look-up table are not ordered as the 4 or 5 least 
significant bits with a left-most bit being a most significant 

10 bit and a right-most bit being a least significant bit but 
rather are scattered over the 9 bit input. For example, at 
step 1010, for group 1 the bits x' 2 , x' 3 , x'< , x^ are to be used 
for looking-up a respective look-up table; however, the bits 
x' 2 , x' 3 , x'<, x' 5 are not ordered as least significant, bits of the 

15 input X' . As such, in the illustrative example at step 1010 a 
subset of bits of each input X' is selected by manipulation of 
the bits x' p so that the bits of the subset of bits are ordered 
as least significant bits for indexing into one or two vectors. 
In Figure 14, the bits x' p are shown in a column 1310 for each 

20 group 1 to 6. In a column 1320, at most eight of the nine bits 
x' are shown for each group 1 to 6 being re-ordered for 

p 

indexing into one or two vectors. In particular, subsets of 
bits 1330, 1331, 1332, 1333, 1334, 1335 for which the look-up 
tables are looked-up for each group 1 to 6 are shown in column 

25 1320. For example, for group 1 the subset of bits 1330 
contains bits x' 5 , x' 4 , x' 3 , x' 2 being re-ordered as least 
significant bits. The instructions used for re-ordering the 
bits x' p are listed for each group 1 to 6 in a column 1340. In 
particular, in the illustrative example for group 1 a vsrb 

30 (vector shift right byte) instruction is used to manipulate the 
bits x',, ; for group 2 a vsel instruction is used to manipulate 

34 
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the bits x' p ; for group 3 a vrlb (vector rotate left byte) 
instruction is used to re-order the bits x' p ; for group 4 a vsel 
instruction is used to manipulate the bits x' p ; for group 5 a 
combination of vslb (vector shift left byte) and vsel 
5 instructions is used to manipulate the bits x' p ; and for group 6 
a combination of vsrb and vsel instructions is used to 
manipulate the bits x' p . In column 1320, although the subsets 
of bits 1330, 1331, 1332, 1333, 1334, 1335 are ordered as least 
significant bits, within each subset of bits there is no 
10 specific ordering of bits required. This is because a look-up 
table may be pre-determined for any ordering of the bits within 
a subset of bits. 

The manipulation of bits will now be described in 
further derail with reference to Figures 15A to 15F. In 
15 particular, a number of vector operations will be used to 

manipulate the bits of each input X' in parallel. As discussed 
above, for group 1 a vsrb instruction is used to re-order the 
bits x'p of each input X' in parallel. For example, as shown in 
Figure 15A, for group 1 the vsrb instruction operates on a 
20 vector 1404 containing 1-byte elements (only ono 1-byte element 
1402 is shown for clarity). Each element 1402 contains the bits 
x' 7 , x' 6 , x' 5 , K, x' 3 , x' 2 x' 1# x; of a respective input X'. In the 
elements 1402, the bits x, , x' 6 , x' s , x' 4 , x' 3 , x' 2 x[ , x' 0 are 
represented by their indexes 7, 6, 5, 4, 3, 2, I, 0, 
25 respectively. For each input X' , the vsrb instruction shifts 

right the bits x' 7 , x' 6 , x' 5 , x\ , x' 3 , x' 2 , x' A , x' 0 by two bit units 
and outputs a vector 1406 containing 1-byte elements (only one 
1-byte element 1407 is shown for clarity) . For the vsrb 
instruction of Figure 15A, each element 1407 has the bits x' 7 , 
30 x' 6 , x^, x' 4 , x 3 , x' 2 represented by indexes 7, 6, 5, 4, 3, 2, 

35 
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respectively, as least significant bits and the bits xi , x' 0 of 
element 1402 represented by their indexes 1 and 0, 
respectively, are lost leaving two free most significant bits 
1408 and 1409 with a zero value represented by "0". In the 
5 element 1407, the bits x' s , x' 4 , x' 3 , x' 2 of the sublet of bits 
1330 are ordered as least significant bits. 

In Figure 15B, for group 2 using a vse] operation the 
vector 14 06 which is output from the vsrb instruction for group 
1 is used in combination with the bits x' p of each input X' to 

10 manipulate the bits x' p . In particular, the vsel instruction 

operates on the vectors vA 3 1410 and VB3 1412 using a vector VC3 
1414. The vector vA 3 1410 corresponds to the vector 1406 of 
Figure 15A and the vector vB 3 1412 contains the bits x' 7 , x' 6 , 
x' s , x' 4 , X3, x' 2 , x[ , x' 0 of each input X'- The veotor vC 3 1414 

15 has 16 1-byte elements (only one 1-byte element 1416 is shown 
for clarity) each having a constant 00000011 in base-2 notation 
as an entry. Each entry of the element 1418 of vector vC 3 1414 
is used to select bits from the vectors vA 3 1410 and vB 3 1412 
resulting in a vector vD 3 1416 having 1-byte elements (only one 

20 1-byte element 1419 shown for clarity) . The element 1419 
contains two "0" bits as most significant bits and contains 
bits x' 7 , x' 6 , x' s , yl\ , xi , x^ represented by indexes 7, 6, 5, 4, 
1, 0, respectively, as least significant bits. In the element 
1419, the bits x' s , x' a , x[ , xj, of the subset of bits 1331 are 

25 ordered as least significant bits for indexing into a vector. 

For group 3, a vrlb (vector rotate left byte) 
instruction is used to re-order the bits x' p of each input X' . 

In Figure 15C, a vector 1422 has 16 1-byte elements (only one 
1-byte element 1420 is shown for clarity) . Each element 1420 
30 contains the bits x^ , x' 6 , x' 5 , x' 4 , x' 3 , x' 2 , xi , x' Q represented by 
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7, 6, 5, 4, 3, 2, 1, 0, respectively, of a respective input X'. 
In each element 1420 the bits A , A . A> A' x 3 ' *i' x '» ' x '° 
are rotated left by two bit units resulting in a vector 1424 
having 1-byte elements (only one 1-byte element 1426 is shown 
5 for clarity) containing re-ordered input bit A , A , A* x 2 ' 

v . „' x ' x ' . In each element 1426, the bits x' 2 , A > x 'o ' x t 
x' of the subset of bits 1332 are ordered as least significant 



^6 

bits. 



10 



In Figure 15D, for group 4 using a vsel operation the 
vector 1424 which is output from the vrlb instruction for group 
3 is used in combination with the bits x' p of each input X' to 
manipulate the bits x' p . In particular, the vsel instruction 
operates on vectors vA 4 1430 and vB 4 1432 using a vector vC 4 
1434. The vector vB 4 1432 corresponds to the vector 1424 of 
15 Figure 15C and the vector vA 4 1430 contains the bits x', , At 
x' s , A. A, A, A of e^h input X'. The vector vC 4 1434 

has 16 1-byte elements (only one 1-byte element 14 39 is shown 
for clarity) each having a constant 00000011 in base-2 notation 
as an entry. Each entry of the element 1439 of vector vC 4 1434 
20 is used to select bits from the vectors vA 4 1430 and vB 4 1432 
resulting in a vector vD 4 1436 having 16 1-byte elements (only 
one 1-byte element 1438 is shown for clarity) . Each element 
1438 contains bits A, < , x' s , A / x ^ x * > A' < represented 
by indexes 7, 6, 5, 4, 3, 2, 7, 6, respectively, as re-ordered 
25 bits. In the element 1438, the bits x^ , , x' 2 , >:', , x' 6 of the 
subset of bits 1333 are ordered as least significant bits. 

For group 5, a combination of a vslb (vector shift 
left byte) instruction and a vsel instruction is used to obtain 
the subset of bits 1334. In Figure 15E, the vslb instruction 
operates on a vector 1440 having 16 1-byte elements (only one 

37 
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1-byte element 1444 is shown for clarity) . Each elements 1444 
contains bits x 7 , x' 6 , x' 5 , x' 4 , x' , x' 2 , x' x , x' 0 of a respective 
input X' and the vslb instruction shifts left the bits x' 7 , x' 6 , 
x' 5 , x' 4 , x' 3 , x' 2 , xl , x'o by one bit unit and outputs a vector 
5 1442. The vsel instruction then makes use of the vector 1442. 
In particular, the vsel instruction operates on vectors vA 5 1446 
and vB 5 1448. The vector vA 5 144 6 corresponds to vector 1442 
obtained from the vslb instruction and the vector vH a 14 48 
contains 16 1-byte elements (only one 1-byte element 1445 is 
10 shown for clarity) . Each element 1445 contains the bit x' B of a 
respective input X' . The vsel instruction operates on the 
vectors vA 5 1446 and vB 5 1448 using a vector vC s 144 L having 16 
1-byte elements (only one 1-byte element 144 9 is shown for 
clarity) . Each element 1449 has a constant 00000001 in base-2 
15 notation as an entry to select bits from the vectors vA 5 1446 
and vB 5 1448 resulting in a vector vD 5 1443 having a 1-byte 
element 1447 for each input X' (only one element 1447 is shown 
for clarity). The element 14 47 contains bits x' 6 , x' 5 , x' 4 , x 3 , 
x' 2 , x' w x 0 , x' B represented by indexes 6, 5, 4, 3, 2, 1, 0, 8, 
20 respectively, as re-ordered bits. In the element 1447, the 

bits x' 3 , x' 2 , x[, x 0 , x' e of the subset of bits 1334 are ordered 
as least significant bits. 

For group 6, a combination of a vsrb instruction and 
a vsel instruction is used to obtain the subset of bits 1335. 

25 In Figure 15F, the vsrb instruction operates on a vector 1450 
having 16 1-byte elements (only one 1-byte element 1453 is 
shown for clarity). Each elements 1453 contain:* bits x' 7 , x' 6 , 
x' 5 , x' 4 , x 3 , x' 2 , x' w x' 0 of a respective input X' and the vsrb 
instruction shifts right the bits x', , x' 6 , x' s , x' 4 , x' 3 , x' 2 , x lt 

30 x 0 by three bit units and outputs a vector 1452. The vsel 

instruction then makes use of the vector 1452. In particular, 
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the vsel instruction operates on vectors vA 6 1454 and vBe 1456. 
The vector vA 6 1454 corresponds to vector 14 52 obtained from the 
vsrb instruction and the vector vB 6 14 56 contains 16 1-byte 
elements (only one 1-byte element 14 57 is shown for clarity) . 
5 Each element 1457 contains the bit x' 8 of a respective input X' „ 
The vsel instruction operates on the vectors vA 6 L454 and vB 6 
1456 using a vector vC6 1456 having 16 1-byte elements (only one 
1-byte element 1549 is shown for clarity) . Each element 1549 
has a constant 00000001 in base-2 notation as an entry used to 

10 select bits from the vectors vA6 1454 and vB6 145?i resulting in 
a vector vD$ 1451 having a 1-byte element 1455 for each input X' 
(only one 1-byte element 1455 is shown for clarity) . The 
element 1455 contains bits three null bits as most significant 
bits and contains bits , x^ , x' s , x[ , x' e represented by 

15 indexes 7, 6, 5, 4, 8, respectively, as least significant re- 
ordered bits. 

Step 1010 of Figure 11 will now be described for 
group 1 of the illustrative example in which a vperm 
instruction is used for looking-up a look-up table. For group 

20 1, referring back Figures 13 and 14 columns 1240 and 1320 

indicate that for each input X' four of the bits x' p form the 
subset of bits 1330 are used to look-up a look-up table. As 
such, as indicated in a column 1250, for group 1 the vperm 
instruction operates on one vector having 16 1-byte elements . 

25 Similarly, for group 2 for each input X' there are 4 of the 
bits x' p used for looking up a look-up table and the vperm 
instruction operates on one vector having 16 1-byte elements as 
indicated in column 1250. For groups 3 to 6, for each input X' 
there are 5 of the bits x' p used for looking up look-up tables 

30 and the vperm instruction operates on two vectors each having 
16 1-byte elements bits as indicated in column 1250. 
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The vperm instruction will now be described with 
reference to Figure 16 for a look-up for group 1 as an example. 
For group 1, the vperm instruction operates on a vector vA 7 1510 
using a vector vC 7 1530. The vector vA 7 1510 contains 16 1-byte 
5 elements (only 7 elements 1515 are shown for clarity) each 
containing an element of the look-up table for gcoup 1. The 
vector vC 7 1530 contains 16 1-byte elements (only 7 elements 
1535 are shown for clarity) each containing the re-ordered bits 
x' 7 , x' 6f x' s , x' 4 , x' 3 , x' 2 (not shown) of a respective input X' as 
10 indicated in column 1320 of Figure 14. The vperm instruction 
makes use of the subset of bits 1330 corresponding to the 4 
least significant bits x' 5 , x'« , x' 3 , x' 2 to select one of the 
elements 1515 to be output as an element 1545 (only 7 element 
1545 are shown for clarity) of a vector vD 7 1540. Each element 
15 1545 of the vector vD 7 1540 contains a 1-byte output for bits 

y' e , y' 6 , y*. y'<< v *' ^< as shown in the set of columns 

1210 of Figure 13. 

For group 2, with reference to columns* 1240, 1250 of 
Figure 13 the vperm instruction makes use of four bits as 

20 indexes into one vector corresponding to vector vA/ 1510 

containing elements of the look-up table for group 2. The four 
bits correspond to x' s , x'< , x' x , x' 0 as shown by the subset of 
bits 1331 in column 1320 of table 1300. Each element 154 5 of 
the vector vD 7 1540 output by the vperm instruction contains a 

25 1-byte output for bits y' e , y' 6 , y' 5 , y\ , Ys > Yz< ' as shown 
in the set of columns 1210 of Figure 13. 

For group 3, as shown in columns 1240, 1250 of Figure 
13 the vperm instruction makes use of five bits as indexes into 
two vectors corresponding to vector vA 7 1510 and another vector 
30 vB 7 1520. Vectors vA 7 1510 and vB 7 1520 contain elements of the 
look-up table for group 3. The five bits correspond to x' 2 , x\ , 
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x' 7 , x' 6 as shown by the subset of bits 1332 in column 1320 
of table 1300. Each element 154 5 of the vector vD, 1540 output 
by the vperm instruction contains a 1-byte output for bits y e , 

y' 6 , y' 5 , y' 4 , y' 3 , y' 2 , y'o " shown in the sez c,f < ;olumns 1210 

5 of Figure 13. 

For group 4, as shown in columns 1240, 1250 of Figure 
13 the vperm instruction makes use of five bits as indexes into 
the two vectors vA 7 1510 and vB 7 1520. In this case vectors vA 7 
1510 and vB 7 1520 contain elements of the look-up table for 

10 group 4. The five bits correspond to x'< , x' 3 , x' : , x 7 , x' 6 as 
shown by the subset of bits 1333 in column 1320 of table 1300. 
Each element 1545 of the vector vD 7 154 0 output by the vperm 
instruction contains a 1-byte output for bits y' 8 , y' 7 , y' 6 , y' s i 
Yi > y' 3 < y'2' yi as shown in the set of columns 1210 of Figure 

15 13. 

For group 5, as shown in columns 1240, 1250 of Figure 
13 the vperm instruction makes use of five bits to look up the 
two vectors vA 7 1510 and vB 7 1520 in which the look-up table for 
group 5 is loaded. The five bits correspond to x' 3 , x' 2 , xi , xj, , 
20 x' a as shown by the subset of bits 1334 in column 1320 of table 
1300. Each element 1545 of the vector vD 7 1540 output by the 
vperm instruction contains a 1-byte output for bits y' B , y 7 , y « , 
ys, y' 4 » y' 2 / yi as shown in the set of columns 1210 of 

Figure 13. 

25 For group 6, as shown in columns 1240, 1250 of Figure 

13 the vperm instruction makes use of five bits to look up the 
two vectors vA 7 1510 and vB 7 1520 in which the look-up table for 
group 6 is loaded. The five bits correspond to x 7 , x' 6 , x' 5 , x' 4 , 
x' a as shown by the subset of bits 1335 in column 1320 of table 
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1300. Each element 1545 of the vector vD, 1540 output by the 
vperm instruction contains a 1-byte output for bits y' 7 , y' 6 , y', , 
y « y ' 3 , y ' 2 , y ' lf y'o as shown in the set of columns 1210 of 
Figure 13. 

5 in some embodiments of the invention, for each input 

X' two or more of the outputs obtained from rhe looK-up tables 
form sets of first outputs. For each input X', oacti set of 
first outputs has at least two of the outputs obcained from the 
look-up tables for the input X' . Referring back to Figure 11, 

10 step 1020 will now be described with reference to Figure 17 for 
embodiments in which outputs from step 1010 form such sets of 
first outputs. At step 1610, for an input X' for each set of 
first outputs, the first outputs are combined into a second 
output, and at step 1620 the second outputs are combined by 

15 manipulating bits of at least one of the second outputs to 
produce an overall output. 

The method of Figure 17 will now be applied for the 
illustrative example in which outputs are obtained using vperm 
instructions. As shown in the set of columns 1210 of table 

20 1200 for each group 1 to 6 there are eight output bits being 
generated for determination of rhe nine bits y' p . In 
particular, outputs from groups 1 to 3 all have bits generated 
for determination of outputs bits y' 8 , y' 6 , y' 5 , y'< , y' 3 / y' 2 / Vi * 
y 0 and form a set of first outputs 1260. Similarly, outputs 

25 from groups 4 and 5 all have bits generated for determination 
of outputs bits y' a , y' 7 , y' 6 , y' s , y\ , )t\ , V* » v'i ^d form another 
set of first outputs 1270. At step 1610, the first outputs 
1260 are combined using exclusive-OR operations and rhe firsr 
outputs 1270 are also combined using exclusive-OR operations. 

30 In particular, in the illustrative example the exclusive-OR 
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operations are applied using an Altivec vxor (vector exclusive- 
OR) instruction. 

The steps of the method of Figure 17 will now be 
described with reference to Figure 18, which is a flow diagram 
5 showing how vectors containing outputs are combined by being 
operated on using exclusive-OR and bit manipulation operations. 
In particular, the flow diagram of Figure 18 is used to 
illustrate the method steps of Figure 17 in which for an input 
X' for each set of first outputs, the first outputs are 
10 combined into a second output, and the second outputs are then 
combined by manipulating bits of at least one of the second 
outputs . 

In Figure 18, a vector 1611 has a 1-byte element 1615 
for each input X' (only one element 1615 is shown for clarity) 

15 with the 1-byte 1615 element containing bits from the first 

output 1260 of group 1. The bits from the first output 1260 of 
group 1 are identified as 6, 5, 4, 3, 2, 1, 0, B in element 
1615 and are used for determination of bits y' 6 , y' 5 , y' 4 , Yj , y' 2 < 
y\, y'o> y' 8 ' respectively. A vector 1620 has a 1-byte element 

20 1625 for each input X' (only one element 1625 if? shown for 

clarity) with the 1-byte 1625 element containing bits from the 
first output 1260 of group 2. The bits from the first output 
1260 of group 2 are identified as 6, 5, 4, 3, 2, 1, 0, 8 in 
element 1625 and are used for determination of bits y^ , y' 5 , y' 4 , 

25 y' 3 , y' 2 , yi, y' 0 . y' e , respectively. A vector 1630 having a 1- 
byte element 1635 for each input X' (only one element 1635 is 
shown for clarity) with the 1-byte 1635 element containing bits 
from the first output 1260 of group 3. The bits from the 
first output 1260 of group 3 are identified as 6, 5, 4, 3, 2, 

30 1, 0, 8 in element 1615 and are used for determination of bits 
Y' s , y' 5 , y'«, y' 3 < Yz> y'w y'a* respectively. 
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For the set of first outputs 1270, a vector 1640 has 
a 1-byte element 1645 for each input X' (only one element 1645 
is shown for clarity) with the 1-byte 164 5 element containing 
bits from the first output 1270 of group 4. The bits from the 
first output 1270 of group 4 are identified as 7, 6, 5, 4/ 3, 
2, 1, 8 in element 1645 and are used for determination of bits 
y? / y'e / y's/ y'</ y' 3 / y' 2 / yi/ y'a> respectively. A vector 1650 
has a 1-byte element 1655 for each input X' (only one element 
1655 is shown for clarity) with the 1-byte 1655 element 
containing bits from the first output 127 0 of group 5. The bits 
from the first output 1270 of group 5 are identified as 7 , 6, 
5/ 4, 3/ 2, 1, 8 in element 1655 and are used for determination 
of bits y' 7 , y' 6 , y' s , y' 4 , y' 3 , y' 2 , y[ , y' 8 , respectively. 

A vector 1654 has a 1-byte element 1664 for each 
15 input X' which is obtained from a combination of vectors 1611, 
1620, 1630, 1640, 1650 using exclusive-OR operations 1901, 
1902, 1903, 1904. In particular, the element 1664 has a bit 

1666 that corresponds to a result for bit y' 8 and seven bits 

1667 having entries "A" which in this case are not used, 

A vector 1632 has a 1-byte element 1636 for each 
input X' (only one element 1636 is shown for clarity) with a 
most significant bit 1637 having a zero value represented by 
*0". The vector 1632 is obtained from a combination of 
vectors 1611, 1620, 1630 using exclusive-OR operations 1901, 
1902 and from a vsrb operation 1906. 

A vector 1652 has a 1-byte element 1653 for each 
input X' (only one element 1653 is shown for clarity) with a 
bit 1658 having a zero value represented by u 0"« The vector 
1652 is obtained from vectors 1640 and 1650 using an exclusive- 
30 OR operation 1903 and using an Altivec vandc (vector and 
complement) operation 1907. 
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A vector 1675 has an element 1670 for each input X' 
(only one element 1670 is shown for clarity) . Bits within the 
element 1670 are identified by indexes 7, 6, 5, 4, 3, 2, 1, 0 
and are used for determination of bits y', , y' 6 , y'$ , y'« / y's > Y* > 
5 y'x, y'o/ respectively. The vector 1675 is obtained from vectors 
1632, 1652 using an exclusive-OR operation 1905. 

A vector 1660 has a 1-byte element 1680 for each 
input X'. Each element 1680 contains a first oucput 1280 shown 
in figure 13 for group 6. Bits within the element 1680 are 
10 identified by indexes 7, 6, 5, 4, 3, 2, 1, 0 and are used for 
determination of bits y' 7 , y' c , y' s , y' a > Vs > V't • ' ^° ' 
respectively. 

In Figure 18, in combining the first outputs 1260 of 
groups 1 to 3 a first vxor instruction operates on the vectors 

15 1611, 1620, in which corresponding bits of the vectors 1610, 

1620 undergo exclusive-OR operation 1901 and results are output 
into the vector 1620. A second vxor instruction then operates 
on the vectors 1620, 1630 and corresponding bits oJ: the vectors 
1620, 1630 undergo exclusive-OR operation 1902. Results from 

20 the second vxor instruction are output as part of vector 1630 
as a second output; For the first outputs 1270 of groups 4 and 
5, at step 1610 a third vxor instruction operates on vector 
1640, 1650, in which corresponding bits of the vectors 1640, 
1650 undergo exclusive-OR operation 1903 and results are output 

25 into the vector 1650 as a second output. 

A fourth vxor instruction operates th*5 vectors 1630, 
1650 containing the second outputs, and bits wichin the vectors 
1630, 1650 undergo exclusive-OR operation 1904 che result of 
which is output as vector 1654. In particular, the bit 1666 of 
30 vector 1654 corresponds to a result for bit y' 8 . 
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To obtain results for the bits y' 7 , y' 6 , y' 5 / y\ > y'a ' 
y;, yi/ y'o/ * he bits of eleinents 1635 and 1655 of veczots 1630 

and 1650, respectively, are first manipulated. For example, 
the vsrb instruction 1906 is used to shift right by one bit 
5 unit bits of the element 1635 of each input X' of vector 1630 
resulting in vector 1632. For the vector 1650, the bit 1656 
of the element 1655 of each input X' is given a zero value for 
example by operating on the vector 1650 using the Altivec vandc 
instruction 1907 resulting in vector 1652. A fifth vxor 

10 instruction is then used to combine vectors 1632, 1652 in which 
bits within the vectors 1632, 1652 undergo the exclusive-OR 
operation 1905 to obtain vector 1675. Finally, a sixth vxor 
instruction operates on the vectors 1675, 1660 and bits within 
the vectors 1675, 1660 undergo the exclusive-OR operation 1908 

15 the result of which is output as vector 1660. 3 n particular, 
after the sixth vxor instruction each element K.i80 has bits 
identified by indexes 7, 6, 5, 4, 3, 2, 1, 0 that correspond to 
results for bits y' 7 , y' € , y' 5 , y\, y' 3 , V' 2 , y[ . respectively. 

In the illustrative example at step 1010, 8 
20 instructions are used for selecting the subsets of bits 1330, 
1331, 1332, 1333, 1334, 1335 and 6 vperm instructions are used 
in looking up tables for groups 1 to 6. At step 1020, 8 
instruction are used to obtain results for the bits y'„ , y' 7 , y' 6 , 
y'sf y\' y'*> Vi* Vi' ^o- Furthermore, in the illustrative 
25 example, steps 1010 and 1020 are performed in parallel for 16 
inputs X' . As such, a total of 22 instructions are used to 
obtain 16 outputs Y' resulting in an average of 1.4 
instructions for each output Y' . Furthermore, In column 1250 
of table 1200 there is a total of 10 vectors into which the 
30 look-up tables of groups 1 to 6 are loaded taking up only 10 of 
the 32 vectors available on a PowerPC having an Altivec co- 
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processor. As such, the look-up tables of group 1 to 6 provide 
a packing that not only allows the look-up tables for the S9 
functions (the look-up tables of groups 1 to 6) to be loaded 
together into the vectors but also leaves vector* available for 
5 loading the look-up table for the S7 function into the vectors. 

The illustrative example shows how the steps 1010, 
1020 of Figure 11 can be performed to produce outputs in a 
reduced number of instructions to provide a low demand on 
computing resources; however, the invention is not limited to 
10 performing the method steps 1010, 1020 of Figure 11 as 

described by the illustrative example. For example, in the 
illustrative example as shown in Figure 12 there are a total of 
six groups corresponding to groups 1 to 6 for which six look-up 
tables are looked up at step 1010. In other embodiments of the 
15 invention, there are more or fewer groups resulting in more or 
fewer look-up tables being looked-up. In addition, as shown in 
column 1240, for each group 1 to 6 there are 4 or l> of the bits 
x' p being used to look-up each table; however, this is a 
limitation of the vperm instruction only and in other 
20 embodiments of the invention, other instructions may be used 

for looking up look-up tables which require more or less than 4 
or 5 of the bits x' p being used to look-up each Look-up table. 
For each group 1 to 6, the pre-determined value of the look-up 
table is obtained using by way of a partial evaluation of the 
25 S9 function and is a function of a number being definable by a 
bit sequence of one of 4 and 5 bits. However, this is a 
limitation of the Altivec vperm instruction only, and in other 
embodiments of the embodiments of the invention each pre- 
determined value is a function of a number being definable by a 
30 bit sequence other than 4 and 5 bits. In the illustrative, in 
looking-up the look-up tables the outputs from the vperm 
instruction have 8 bits corresponding to fewer than the 9 bits 
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y' x ; however, embodiments of the invention are not limited to 
the outputs from the look-up tables having fewer bits than yi . 
For example the method of Figure 11 is equally application to 
the S7 function in which case the vperm instruction is capable 
5 of outputting bits for all 7 bits y 3 . In addition, while some 
embodiments of the invention are limited to combining outputs 
to obtain the 9 bits y' x in other embodiments of the invention, 
outputs are combined to obtain at least one bit. 

In the illustrative example, the method of Figure 11 
10 is applied to the S9 function and the look-up tables have pre- 
determined values obtained from a partial evaluation of the S9 
function. Furthermore, as described with reference to Figure 
18, the outputs obtained from the look-up tables are combined 
using exclusive-OR operations. Embodiments of the invention 
15 are not limited to the evaluation of the S9 function and other 
functions may be used. Furthermore, in some embodiments of the 
invention in which other functions are used outputs obtained 
from the look-up tables are combined using other operations 
such as addition and multiplication for example. 

20 Regarding the set of columns 1230, specific subsets 

of bits of the bits x' p are selected for each group 1 to 6 and 
in other embodiments of the invention other sublets of bits 
are used for looking-up tables as long as each of the bits x' p 
is used to look-up at least one look-up table. Regarding 

25 column 1220, the number of bits generated for each groups 1 to 
6 is between 5 and 8 and in other embodiments in which the 
evaluation of the S9 function is performed on a PowerPC 
processor having an Altivec co-processor, the number of bits 
being generated for each group defined is 8 or Less; however, 

30 this limitation is imposed only by the architecture on which 
the method is implemented and in other embodiments of the 
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invention, a maximum number of bits that can be generated 
depends on the architecture on which the method of Figure 11 is 
applied. Furthermore, for each group 1 to 6 the set of columns 
1210 shows specific sequences of outputs bits being generated 
5 and in other embodiments of the invention for each group 
defined there are other sequences of output bits. In the 
illustrative example in combining outputs, output bits are re- 
ordered; however, in some embodiments of the invention there is 
no re-ordering of output bits. 

10 wit h reference to Figure 14, column 1320 shows re- 

ordered bits for each of groups 1 to 6; however, the invention 
is not limited to re-ordering bits for each group defined and 
in other embodiments of the invention, the bits x' p are re- 
ordered for at least one of the groups defined. The particular 

15 method of re-ordering the bits using vsrb, vsel, vrlb, and vslb 
instructions is only one example. It is to be understood that 
given a set of input bits, a subset of the bits in a desired 
order can be generated using any suitable technique, as would 
be understood by one skilled in the art. 

20 Referring to Figure 19A, shown is a bLock diagram of 

an apparatus 1805 for implementing the methods of Figures 5 and 
11. The apparatus 1805 has a memory 1810 and a processor 1820 
having a SIMD architecture capable of accessing information 
stored in the memory 1810. The processor receives a plurality 

25 of inputs 1840, and performs parallel processing using the 
inputs 1840 to produce outputs 1830. In particular, memory 
1810 stores a plurality of elements of each of a plurality of 
look-up tables. 

In implementing the method of Figure 5, each input 
30 1840 is defined by a first set of bits and a second set of at 
least one bit. For each input 1840, the processor looks-up in 
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the memory 1810 one element of each look-up table, for which 
elements are stored for the purpose of the method of Figure 5, 
using the first set of bits that define the inpuc. The look- 
ups result in outputs. The processor 1820 selects one of rhe 
5 outputs using the second set of at least one bit that define 
the input 1840. Processing by the processor 1820 is performed 
in parallel for each input 1840 resulting in outputs 1830. 

In implementing the method of Figure 11, each input 
1840 is defined by a plurality of bits. For each input 1840, 

10 the processor 1820 selects a subset of bits of the plurality of 
bits that define the inpur 18 40 with the bits within the subset 
of bits having fewer bits than the input. The processor 1820 
looks-up in the memory 1810 one element from each look-up 
table, for which elements are stored for the purpose of the 

15 method of Figure 11, using the subset set of bits. The look- 
ups result in outputs and the processor 1820 then combines the 
outputs. Processing by the processor 1820 is performed in 
parallel for sets of inputs 1840 resulting in outputs 1830. 

Referring to Figure 19B, shown is a block diagram of 
20 the apparatus 1805 of Figure 19A implemented as a ciphering 
block 1800. The ciphering block 1800 contains the apparatus 
1810 and operates on input data 1850. The apparatus 1805 
implements the Kasumi ciphering algorithm that produces a 64- 
bit output 131 from a 64-bit input 111 under the control of a 
25 128-bit key 121. The input data 1850 undergo exclusive-OR 

operations in parallel using the output 131 from the processor 
1820 resulting in ciphered data 1870. For each input 111 and 
key 121 and in parallel with other inputs 111 and keys 121 (not 
shown), the processor 1820 implements the Kasumi algorithm in 
30 which there are eight rounds of computations. Ac each of the 
eight rounds the processor implements the method of Figure 5 
and 11 to evaluate the S7 and S9 functions, respectively. 
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In some embodiments of the invention the ciphering 
apparatus is implemented at any device requiring ciphering such 
as an RNC (Radio Network Controller) for example. 

Another example implementation is illustrated in 
5 Figure 20. There are N K in -bit inputs 2000 to be processed, 
wherein N and K in are integers satisfying N, K in > 2 , Bit 
permutation/reordering occurs at 2002 to produce M parallel 
sets of outputs 2004,2006 (only two shown). The ith set of 
outputs contains N sets of bits Li,m bits in lenguh and defines 

10 a respective subset of the input bits to be used in performing 
a table look-up. Li, in is an integer satisfying 1 < Li,i n < K in . 
Thus, the first parallel set 2004 contains Li, in bits for each 
input, and the last parallel set 2006 contains L M#in bits for 
each input. For each parallel set of output bits 2004,2006, a 

15 parallel lookup table operation 2008,2010 is performed to 
generate a corresponding parallel set of outputs 2012,2014. 
The ith set of parallel outputs contains N outputs, one 
associated with each of the N inputs 2000, each of which is 
Li, out bits in length wherein Li, out is an integer satisfying L i/0U T: 

20 > 1. Thus, the first output set 2012 contains N outputs each 
Li, ouc bits in length, and the last output 2014 contains N 
outputs each L M ,out in length. Finally, for each of the N 
inputs, a respective output is generated by performing a bit 
combining and in some cases bit manipulation operation on the 

25 outputs of the parallel look-up table operations 2008,2010 
associated with that input. The combining operations are 
collectively indicated generally at 2016 and are preferably 
implemented in parallel- This produces outputs 2018 which 
include a first K 0U t"bit output 2020 through Nth K 0U t-bit output 

30 2022 wherein K ou t is an integer satisfying EW ^ 1. 

In preferred embodiments, the sets of bits produced 
by the bit permutation/reordering 2002 are selected such that 
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each set of bits effects only some respective defined maximum 
number Pi < K of bits in the outputs- In this manner, each 
parallel look-up table operation can be implemented using a 
vector operation which operates in parallel on N inputs to 
5 select N Pi-bit outputs wherein Pi is an integer. If a vector 
operation is available which is capable of looking up K-bit 
values, this constraint on the bit permutation/reordering 2002 
would not be necessary. 

The example described previously with reference to 
10 Figures 12-18 is a very specific example of the implementation 
of Figure 20 in which there were N = 16 inputs- Different 
numbers of inputs can be employed. In the example, each 
overall input was K in = 9 bits in length. Other lengths can be 
employed. In the example, there were 9 bit output:?. Other 
15 lengths can be produced. In the example, there were 6 sets of 
parallel outputs each of which was either 4 bit:; or 5 bits in 
length and 6 table look-up operations. Other numbers of 
outputs/table look-up operations can be used and these can have 
any suitable bit lengths. In the example, each output of the 
20 parallel look-up operation was 8 bits in length. Other lengths 
can be used. 

Numerous modifications and variations of the present 
invention are possible in light of the above teachings. It is 
therefore to be understood that within the scope of the 
25 appended claims, the invention may be practised otherwise than 
as specifically described herein. 
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